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Abstract — Consider the set of source distributions within a 
fixed maximum relative entropy with respect to a given nominal 
distribution. Lossless source coding over this relative entropy ball 
can be approached in more than one way. A problem previously 
considered is finding a minimax average length source code. The 
minimizing players are the codeword lengths — real numbers 
for arithmetic codes, integers for prefix codes — while the max- 
imizing players are the uncertain source distributions. Another 
traditional minimizing objective is the first one considered here, 
maximum (average) redundancy. This problem reduces to an 
extension of an exponential Huffman objective treated in the 
literature but heretofore without direct practical application. 
In addition to these, this paper examines the related problem 
of maximal minimax pointwise redundancy and the problem 
considered by Gawrychowski and Gagie, which, for a sufficiently 
small relative entropy ball, is equivalent to minimax redundancy. 
One can consider both Shannon-like coding based on optimal real 
number ("ideal") codeword lengths and a Huffman-like optimal 
prefix coding. 

Index Terms — Source Coding, Minimax Coding, Relative En- 
tropy, Redundancy. 

I. Introduction 

A. Preliminaries 

The well-known problem of finding a uniquely decod- 
able code with minimum average codeword length over a 
memoryless source gives rise to the optimal Huffman code 
and the near-optimal Shannon code. The derivation of the 
latter code assures that average redundancy of neither code 
exceeds 1. However, suppose the true distribution of the source 
is unknown and the code is designed solely with respect to a 
nominal distribution, for example one derived from access to 
limited empirical data. In this case the relative entropy between 
the nominal distribution and the true distribution appears in the 
redundancy bounds of average codeword length; indeed, the 
lower bound is the sum of the entropy and the divergence 
IH Theorem 5.4.3]. Consequently, under such uncertainty, the 
nominally optimal codes will be robust neither in average 
length nor in redundancy. 

Suppose the nominal or approximate distribution of a source 
is n, while the unknown or true probability distribution of the 
source is any v which is absolutely continuous with respect 
to /i and satisfies a relative entropy constraint. This constraint 
is B(i/||/i) < R, where R is a known, positive entropy value 



(in nats) and 

d( I /||m) = E^ 1o s- 

where the sum is over all events and log is the natural 
logarithm. 

Source coding problems are often idealized so that code- 
word lengths need not be integer; this is useful for arithmetic 
codes, for calculating limits for arbitrarily large blocks, and 
for finding approximate solutions, and, by extension, per- 
formance bounds. Such approximate solutions often require 
less computation and/or memory to compute than optimal 
solutions, and these robust Shannon-like codes will be our 
intended application of the relaxation of £j € Z + to li E R 
(with no further mention of arithmetic codes and other related 
applications). 

We can assume the M -member input alphabet is, without 
loss of generality, equal to X = {1,2, ... ,M}. Denote the 
class of all uniquely decodable codes defined on X by C(X) 
or Cm- We use codes and codeword length ensembles inter- 
changeably, since one can easily be derived from the other; 
thus valid solutions can be said to satisfy I £ C{X) for random 
vector £. An ensemble of lengths corresponds to a uniquely 
decodable code if and only if all values are natural numbers 
and satisfy the Kraft inequality D~ * < 1. Denoting 
the set of real vectors which satisfy the Kraft inequality as 
JC(X), we can mathematically restate the previous sentence 
as C(X) = JC(X)r\Zl I where is the set of all M-vectors 
of positive integers. 

B. Objectives 

There are several possible approaches to such a source 
coding problem. 

(Average) Minimax Length Approach: This approach, ex- 
plored in ||2], is concerned with the minimax average length 
formulation which, with a relative entropy constraint, is 

inf sup E^fi) (1) 

eteic(x) {„ ; b(v\\h)<r} 

for robust Shannon code l\ = and 

inf sup E„(f) (2) 

t*GC(X) { y; ID)(y||/i)<fl} 



for robust Huffman code £*, where E„ denotes expectation 
with respect to the distribution v. The main objective of the 
minimax formulation is to encode the output of uncertain 
sources using the worst case distribution v* , i.e., the distribu- 
tion which maximizes the average length as a function of the 
nominal distribution \l. By judging the coding of the uncertain 
source according to v* , the resulting code will be robust in 
the sense of average length over the set of uncertain sources 
which satisfy the relative entropy constraint. 

(Average) Minimax Redundancy Approach: This approach, 
explored in Section [TTJ finds codes robust in the sense of 
average redundancy. The minimax redundancy (or average 
minimax redundancy) formulation is 

inf sup E„(£ AL ) (3) 

£ AL £C(X) { v -B( v \\fj,)<R} 

for the Huffman-style problem, where Md(v) is the entropy 
of the source with distribution v in terms of compressed D- 
ary symbols. The Shannon counterpart is trivially the Shannon 
code ^ hannon — [— l°g_D Mil f° r M> since ideal codeword length 
V. — — \og D [ii has an expectation equal to entropy plus 
relative entropy JT] Section 13.1]. Because of this, in the 
idealized problem — no matter what the domain for v — it is 
sufficient to find the smallest relative entropy ball that contains 
the set and use the optimal code around which the entropy 
ball is centered. Thus, a divergence entropy ball is a "natural 
set" IU Section 13.1], also seen in past treatments of optimal 
codes and related problems J3]-|I5]. This is closely related 
to the method of types, in which relative entropy of observed 
distribution v relative to actual distribution /i is closely related 
to the probably of a given type JU, Q. 

Gawrychowski-Gagie Approach [8|: This approach, like the 
average redundancy, is based on the premise of minimizing the 
average amount by which codeword lengths exceed what they 
"should." In this case, rather than using the amount average 
codeword length exceeds the entropy of v — the ideal average 
length with knowledge of v — we use the amount it exceeds 
the entropy of v plus the relative entropy of v with respect to 
/i — the ideal average length without knowledge of v: 

inf sup W G ) -M D (u) -B D (u\\fi) (4) 

e aa ec(x) { U ;b(u\\ij,)<r} 

for the Huffman-style problem, where n D (v\\^i) = 
(log D) _1 ]D)(f||/i), the relative entropy using base D. Again, 
the Shannon version is trivial. However, unlike the other 
utilities discussed, here the optimized value includes a term 
from the nominal distribution /i; we look at this utility in 
association with the previous one in Section HI1 

Maximal Minimax Redundancy Approach: This approach, 
considered in Section [Till is concerned with the maximal 
minimax (pointwise) redundancy formulation. This type of 
redundancy takes the maximum difference between £i and 
l\ = -lo g D Hi rather than the expected difference. This 
results in 



in the Huffman case and 



inf 



sup 



max (if 1 + \og D v k ) 



(6) 



e™etc(x) {^ ;D (^|| M )<i?} fe€[i,Af] 

in the real case from which a Shannon-like code can be 
derived. We investigate (O-© here. 

C. Literature Review 

There is a significant literature dealing with source coding 
for an unknown source when either the empirical distribution 
of the source is available or the uncertainty is modeled through 
certain unknown parameters 1191- lflTI . A good survey of the 
literature is found in lfl2l . Uncertainty modeling using relative 
entropy has been considered in iTPJl . in which the problem of 
coding for only one unknown source is addressed. However, 
unlike ifTZl and JT3), here we are dealing with the problem 
of source coding for a class containing many sources which 
satisfy the relative entropy constraint. Our modeling assumes 
a knowledge of the nominal distribution and the uncertainty 
radius R. Universal modeling using relative entropy has been 
discussed in [4|, where the tightest upper bound for the relative 
entropy between empirical distributions (of available training 
sequences), and a nominal distribution is found. The nominal 
distribution is itself computed as part of a search algorithm. 
Longo and Galasso considered relative entropy balls in which 
the same code was optimal for all probability mass functions 
[|3| . Universal utility is taken over the entire simplex in 
|]8] , while is considered for arbitrary sets in [14] and [15|. 

Finally, we point out that the current minimax source coding 
formulation may be generalized to applications in which the 
nominal distribution is parameterized as in ifTZl . In this case, 
the methods found in |[T2l which employ maximum likelihood 
techniques can be invoked to estimate the parameters of the 
nominal distribution. Additional generalizations may include 
situations in which the source is described by ergodic finite- 
state Markov chains. 

II. Minimax Redundancy Coding 

Let AA(X) denote the set of all probability mass functions 
defined on X. Nominal probability distribution fi € A4(X) 
is known; we assume that any (unknown) true probability 
distribution v is absolutely continuous with respect to /z in 
A4(X), i.e., if fa = for any i g X, then = 0. In 
addition, we know non-negative R defining the domain of 
possible solutions, Mr = {v e M(X);H)(i/\\ij,) < R}. Given 
nominal /i 6 Ai(X) and non-negative R, the problem is to 
find the codeword lengths {£*} of a uniquely decodable code 
and a probability measure v* eMj; which solve the following 
generalized minimax source coding problem: 



J f (£*,v*) = inf (^,...,f M ) sup ueM{x) f(i, v) 



Subject to ©HI A*) < R, 



LP D - 



< 1 



(7) 



inf sup max ( 



+ log D v k ) (5) 



where / is the utility in question. Values are the lengths 
of the D-ary codewords. Encoding based on the worst case 
measure v* E A4r results in average codeword length being 
less sensitive to different source distributions within set A4r. 



Recall 



inf sup (EJ£ Ah )-M D M) © 



m D (v)-n D (v\\n)) © 



and the closely related 

inf sup (EJ£ 

t™ec(x) { v;B {u\\fi,)<R} 

two specific instances of ©. The former, minimax redundancy 
(or average minimax redundancy), is one of the most widely 
used measures of performance for designing codes when the 
source is subject to uncertainty (e.g., [9], [10], and [16]). As 
it turns out, the minimax solution to this problem leads to 
encoding with an exponential pay-off similar, but not identical, 
to (2) in |2]. This encoding, where it applies, also solves 

Recall {£i,...,£m} denotes codeword lengths for the 
source symbols {1, . . . , M}. Assume v G M.r, which implies 
HXHI/-*) — Let r(£, v) denote the redundancy of the code. 
Formulate the problem of average minimax redundancy as: 

inf sup (E v (£)-m D (u)) 
(ii,...,e M ) ueM R 

= inf sup r(£, v). 
(ti,...,e M ) ugMr 

Average redundancy can be written as follows: 



(8) 



where Oi = D li for all i € {1, ... , M}. Now consider the 
following probability distribution. 



"i°G8) = 



fM 



P > 



(9) 



The relative entropy between v and v°(/3) can be found as 
follows. 







©MK03)) = o(HI^) + iog^li(t) 
-^E^i^iog(f) 

Now substitute YliLi v i lo S f- from GSJ in ©■ Then 

r(t, v) = ^(^Lo^-^uW^m 



(10) 



(11) 



The supremum of (ITTb is attained at v = v°(p), where (3 is the 
value such that B>(i/°(/3)||/i) = i?, for those problems in which 
such a (3 > exists. This maximizes /3 within the relative 
entropy constraint, thus maximizing the first term, bringing the 
second term to zero, and maximizing the third term (due to 
Lyapanov's inequality for moments, an application of Holder's 
inequality, e.g., JT7] p. 27] or (181 p. 54]). Hence, 



sup r{£,v) = r(£,u (P)) 

v£M R 



(12) 



Therefore, minimizing the worst case redundancy subject to 
codeword lengths leads to the following subproblem. 



inf («i,.. 
= inf i 



6m) p \°Sd Efe=i ( e k k J <■:. u ^ 



where £' k = log D This is a special case of the general 
problem first posed in [19| and also considered in [20 1 and 
[21]. Of these last two, the former gives an optimal solution, 
while the latter notes that this solution is closely related to 
Shannon coding; for (fT2l . a conventional Shannon code will 
not exceed the optimal code by more than one output symbol 
per input symbol. 

The optimal solution is obtained using the exponential 
Huffman coding as described in [2], [20], ll22l . ||231 for the 
following cost function: 



inf 

(ii,...,t M ) 



M 

fe=i 



where s — (3 log D and ^ is a probability distribution given by 



6 



4 +1 



EM S+l 



This solution is linear time given sorted /i. The f3 for which 
D(i/°(/3)||/x) = R can be found using root-finding methods, 
such as those described for © in [|2]. 

As to the matter of whether such a j3 exists, consider 
P — > oo. (Taking P — > corresponds to R = 0.) As shown 
in ED, there exists a P' such that, for all P > P', ([TJ) 
is solved by the optimal codeword solution to the minimax 
pointwise redundancy problem, the equation itself becoming 
inff maxfc(£fc — £' k ). Taking the limit of ©, we find the 
following maximal probability mass function: 

Mi 1 * 



lim v°{p) 

0— J-OO 



EM ■* * 



where 1* is 1 if ^ — maxfc ^ and otherwise. (For example, 
if this ratio has a unique maximum with index k, then v°° is 
1 for k and for all other values.) This results in 



M 

fe=i 



(14) 



v°° being the maximizer and the two-variable solution of OTl 
the minimizer of the minimax problem when R is D(^°°||^(). 
If D(^°°||^) exceeds R, then the desired P exists due to 
continuity; otherwise, it does not. In the case that it does not, 
the simplex boundaries (y°{P) > 0) come into play and the 
optimization no longer conforms to our model. 

Now recall the utility of Gawrychowski and Gagie JU, 
shown at the beginning of this section in ©, which can also 
be expressed as r(£,v) — (log_D) _1 D(^||/i). Note that the 
above analysis for p — for values where it holds — also 
holds for this, since the three terms in ([TTJ are identical; 
only the multiplicative factors are different, and these do 
not affect the optimization. The analysis in the limit also 



holds. In this case, the optimal minimax solution in the limit 
is the optimal code for all R > —log mini Pi, not just 
R = —log min^ /Xi. This is a consequence of the minimax 
solution over the simplex, R —> oo, being the same [8 | (and 
thus the solution for any superset of the aforementioned limit 
ball). Note that while the two-variable solution is the limit, any 
minimax pointwise solution — i.e., any of possibly multiple 
codes minimizing maximum redundancy, as is, e.g., Ell — 
suffices for optimization. 

III. Maximal Minimax Redundancy Coding 
Recall 

inf sup max(£k + log D i>k) (15) 

(e lt ...,e M ) {i/;H>HW<-R} k 

as restricted in (0 and ©. This problem, closely related to 
that of average minimax redundancy, is maximal minimax 
(pointwise) redundancy, another robust coding measure consid- 
ered in [ 14| based on earlier work, e.g., iflZl . Il24l . Pointwise 
redundancy of item k for a code with lengths £ given known 
probability mass function v is equal to Ik + ^og D Vk, which is 
the difference between codeword length and self- information. 
Self-information, equal to — log^, Vk, is the optimal codeword 
length of the coding problem where v is known and there is no 
integer restriction. Therefore minimax redundancy is average 
minimax pointwise redundancy, or, over an arbitrary set of 
probabilities JV C VW(A') (not necessarily a relative entropy 
ball), 

M 

inf sup } (4 + log D v k ) v k 

while maximal minimax pointwise redundancy is 

inf sup max(4 + log n Vk)- 

(lu...J M )^ k 

Since this is equal to 

inf max ilk + log n sup Vk) 
(h,...,e M ) ke[i,M] V £M 

we can find the supremums and then calculate the solution to 
the reduced problem appropriately, whether the robust Shan- 
non or robust Huffman solution is desired. The supremums 
form what is called a normalized maximum-likelihood (NML) 
distribution, tt, which is the normalized version of 



TTfe 



sup V k 



that is, 



2L k = 



Li=i sup„ eA f V, 



If M is the set of probability mass functions within a certain 
total variation T of a known p, this is trivial to compute as 
7Tfc = min(l, Hk + T/2), and the solution can be found based 
on this. In the case considered here, that of a relative entropy 
ball, TV is Mr (that is, {v; B(z/||/z) < R}). The normalization, 
resulting in a constant difference in the minimized utility, is 
optional for the purpose of building a robust Huffman code, 



since the Huffman procedure — efficiently done in [21] for 
sorted probabilities and JS] for unsorted probabilities — is 
scale-invariant. The robust Shannon solution is based on the 
optimal solution with the integer constraint removed; this 
optimal solution is — log M 7r fc and thus the robust Shannon 
code is^ MS = r-log^TTfe]. 

Because finding the optimal solution follows from finding 
the maximum likelihood distribution, the only unaddressed is- 
sue is the following: Given probability mass function p, index 
k, and relative entropy R, find (non-normalized) vector 
which maximizes 7Tj. within the constraint B(7r( fc )||^i) < R, 
so that the non-normalized maximum likely distribution has 



First let us denote the deterministic distribution with all its 
weight on item k as u;( k \ the probability distribution such that 



(k) _ 



= 1 (i.e., for j 7^ k, Uj = 0). For each k, we should 

first check whether D(wfc||/x) < R. If so, clearly 
This is the case if 

1 x 

B(cjfc||/i) = 1 • log hlimy^log — 

hi. — ' // ■ 



1. 



i^k 



log Li k < R 



which occurs if and only if /i^ > e~ R . 

For those values, if any, not satisfying this, we take a La- 
grangian approach to this constrained minimization, reducing 
this problem to several problems each of a single dimension 
with roots that can be easily found. Specifically, if s and A 
are the Lagrangian multipliers, and Jj^ is a function that is 1 
if i — k and otherwise, then to maximize 



L A -'(7rW) 



/ M (fe) \ 



we require that 



dL^in^) f n , 4 fc) 



On 



) + A = 



which means that is proportional to ^ for all % ^ k. 

entropy R, the maximizing ir^ has ir^ = pkl^i for some pk 
on all i 7^ k (and thus 7r[ = 1 — pk(l — p-k)), so we actually 
only need to solve binary divergence 



Thus, given probability mass function a, index k, and relative 



Ak) 



R = B(n {k) \\p) = Tr^log^+^Trfhog^ 



4 fc) iog^ + (i-^r)iog T 

Uk 1 



Pk 

Ak) 



i=£k 



(fc)x 



1 - 7T 



(fe) 



Pk 



Pk 



d(4 fe) IK) 



for the larger of the two possible solutions, where 

A p 1 — p 

d(p to) = p\og h (1 — p) log- . 

m 1 — to 

This solution lies in the range 



r (fc) 



n(o,i) 



(16) 



with the lower bound due to d(/ife||/ifc) = and the non-trivial 
upper bound derived from Pinsker's inequality, due to 

1 



r (fe) 



Jfe) 



1 



[/ife 1 - fi k ] 



< 



fd(4 fc) ii/, fe ) 

For small R, a closed-form approximation can also be obtained 
under the assumption that 7r£ < 2/ifc. Taking terms up to the 
second order for the power series of the logarithm, we find 



R = 




where we use 7r 
term, 
inequality, so that 



(fc) _ 



(fe) 
k 



) to simplify the additive order 
from this term using Pinsker's 



\ 



2R(1 - ii k )n k + O 



R? 



This can be used to find each of the k (approximated) 
solutions or as a first guess in Newton's or Halley's method 
to find solutions with arbitrary precision. Both these methods 
converge quickly (quadratic and cubic convergence, respec- 
tively), as the function is convex increasing and three-times 
continuous differentiable with finite non-zero derivatives and 
no other zeroes over the first two derivatives in the solution 
range ( [Tol l. The solution tt can then be used to construct the 
corresponding robust Shannon or robust Huffman (minimax 
pointwise redundancy) code. Note that in this case, the robust 
Huffman code will have the property that no length is longer 
than that of the robust Shannon code; if it were, the utility 
minimized would be at least 1, whereas the robust Shannon 
code shows that the minimum should be less than 1. Thus, 
even in a pointwise sense, the robust Huffman code is not 
inferior to the robust Shannon code. 
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